Exploratory Data Analysis (EDA)

Project name: E-commerce products


We focus on Web Data Commons - Training Dataset and Gold Standard for Large-Scale Prod- uct Matching dataset (WDC for short) prepared by the staff of the University of Mannheim (Primpeli et al., 2019).

Additionally, each offer is linked to a specific product (cluster id) and contains textual attributes such as title, description etc. </br> </br> The training datasets are available in different sizes, varying from small to extra large. In every dataset, the ratio between positive and negative pairs is 1:3. </br> The proportions between the different collections within one category are as follows: </br>


Reading the data

The data is provided as four different files - each per category: Computers, Cameras, Watches, and Shoes. This division is helpful as we plan to train separate models for each category, so no additional merging is required.

Feature and observation meaning

Each observation is a pair of such offers and a label indicating whether these two offers are for the same product (a positive pair) or not (a negative pair). Even in the case of a negative pair, both offers belong to the same category (but different clusters/products).

Feature meaning: </br> </br> id: Unique integer identifier of an offer </br> cluster_id: The integer ID of the cluster (product) to which an offer belongs. </br> identifiers: A list of all identifier values that were assigned to an offer together with the schema.org terms that were used to annotate the values. </br> category: One of 25 product categories the product was assigned to, NaN if not part of the English subset. </br> title: The product title. </br> description: The product description. </br> brand: The product brand. </br> price: The product price. </br> specTableContent: The specification table content of the products website as one string. </br> keyValuePairs: The key-value pairs that were extracted from the specification tables using the method described above. </br> </br> </br> Note: the 'right' suffix represents a first offer in a pair, the 'left' suffix represents the other offer. </br> A positive pair: cluster_id_left == cluster_id_right </br> A negative pair: cluster_id_left != cluster_id_right </br>


Computers

Missing values (right, and separately left)

We plan to concatenate the features into one text for the input of models. Therefore, missing values are manageable.

Title

Description

Brand

Price

specTableContent

Missing values (right and left simultaneously)

Number of missing (for both right and left offer simultaneously) values per feature.

Average length of title and description


Cameras

Missing values (right, and separately left)

We plan to concatenate the features into one text for the input of models. Therefore, missing values are manageable.

Title

Description

Brand

Price

specTableContent

Missing values (right and left simultaneously)

Number of missing (for both right and left offer simultaneously) values per feature.

Average length of title and description


Shoes

Missing values (right, and separately left)

We plan to concatenate the features into one text for the input of models. Therefore, missing values are manageable.

Title

Description

Brand

Price

specTableContent

Missing values (right and left simultaneously)

Number of missing (for both right and left offer simultaneously) values per feature.

Average length of title and description


Watches

Missing values (right, and separately left)

We plan to concatenate the features into one text for the input of models. Therefore, missing values are manageable.

Title

Description

Brand

Price

specTableContent

Missing values (right and left simultaneously)

Number of missing (for both right and left offer simultaneously) values per feature.

Average length of title and description